Using Classification Techniques to Determine Source Code Authorship
نویسنده
چکیده
The ability to test authorship of source code is a useful technique. We present techniques for using statistical machine learning to accomplish this task. We translate source code into abstract syntax trees and then split up the trees into functions. The tree for each function is considered a document, with a given author. This collection is fed to an SVM package using a kernel that operates on tree structured data. The classifier is trained with source code from two authors, and is then able to prediction which of the two authors wrote a new function. We achieve between 67% and 88% classification accuracy over the set of programs we examined. This demonstrates that our techniques are useful either alone, or in concert with other methods.
منابع مشابه
Application of Information Retrieval Techniques for Source Code Authorship Attribution
Authorship attribution assigns works of contentious authorship to their rightful owners solving cases of theft, plagiarism and authorship disputes in academia and industry. In this paper we investigate the application of information retrieval techniques to attribution of authorship of C source code. In particular, we explore novel methods for converting C code into documents suitable for retrie...
متن کاملComparing techniques for authorship attribution of source code
Attributing authorship of documents with unknown creators has been studied extensively for natural language text such as essays and literature, but less so for non-natural languages such as computer source code. Previous attempts at attributing authorship of source code can be categorised by two attributes: the software features used for the classification, either strings of n tokens/bytes (n-g...
متن کاملPoster: Source Code Authorship Attribution
As information becomes widely available and easily accessible through the Internet and other sources, the trend of plagiarism has been increasing. Plagiarism and copyright infringement are issues that come up in both academic and corporate environments. We need author classification techniques to inhibit such unethical violations. Source code is also intellectual property and reflects individua...
متن کاملSource Code Authorship Attribution Using Long Short-Term Memory Based Networks
Machine learning approaches to source code authorship attribution attempt to find statistical regularities in human-generated source code that can identify the author or authors of that code. This has applications in plagiarism detection, intellectual property infringement, and post-incident forensics in computer security. The introduction of features derived from the Abstract Syntax Tree (AST)...
متن کاملSource Code Authorship Analysis for Supporting the Cybercrime Investigation Process
Cybercrime has increased in severity and frequency in the recent years and because of this, it has become a major concern for companies, universities and organizations. The anonymity offered by the Internet has made the task of tracing criminal identity difficult. One study field that has contributed in tracing criminals is authorship analysis on e-mails, messages and programs. This paper conta...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006